home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
EnigmA Amiga Run 1996 May
/
EnigmA AMIGA RUN 07 (1996)(G.R. Edizioni)(IT)[!][issue 1996-05][EARSAN CD VI].iso
/
progs
/
utilmisc
/
abacus
/
abacus.hlp
< prev
next >
Wrap
Text File
|
1995-03-06
|
76KB
|
1,483 lines
INSTRUCTION MANUAL for ABaCUS (Analysis of Blake's Conjecture Using
Simulations) by Arlin Stoltzfus and David Spencer. Manual version 0.48, 5
July 1994 (by A. Stoltzfus)
==========================================================================
CONTENTS:
==========================================================================
0. HOW TO USE THIS INSTRUCTION MANUAL
I. ABOUT ABACUS
A. BASIC DESCRIPTION
B. HARDWARE AND SOFTWARE REQUIREMENTS FOR THE PRE-COMPILED VERSION
C. COMPILING THE ABACUS CODE FOR ANOTHER ENVIRONMENT
D. CITING ABACUS IN PUBLISHED WORK; PROPRIETARY CLAIMS
II. TUTORIAL: AN ANALYSIS OF CYTOCHROME C DATA
A. STAGE 1: PREPARATION OF THE CYTOCHROME C DATA
B. STAGE 2: CREATING THE NECESSARY DATA FILES
C. STAGE 3: ANALYZING CORRESPONDENCES WITH THE CYTOCHROME C DATA SET
III. GENERAL STEPWISE INSTRUCTIONS
A. STAGE 1: COLLATE THE DATA PRIOR TO USING ABACUS
B. STAGE 2: ENTER THE OBSERVED DATA AND SAVE THEM TO FILES
C. STAGE 3: EVALUATE CORRESPONDENCES
IV. DETAILED COMMENTS
A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS
B. ARRAYS: CREATING, CONVERTING, SAVING AND LOADING
C. BE CAREFUL WHEN ENTERING DATA
D. LOADING ATOMIC COORDINATES FROM A PDB FILE
E. GENERATING REFERENCE GENE DATA
F. SCORING CORRESPONDENCES
G. EVALUATING THE SIGNIFICANCE OF A CORRESPONDENCE
H. SAVING RESULTS; FURTHER ANALYSIS OF SCORES; etc
I. PLOTTING DIAGONAL PLOTS AND EXON PLOTS
V. ADDITIONAL DETAILS
A. HARD LIMITS ON PARAMETERS
B. THE RANDOM NUMBER GENERATOR
C. EXPLANATION OF THE SETTINGS MENU
D. HOW TO CONTACT THE PDB
VI. REFERENCES
==========================================================================
0. HOW TO USE THIS INSTRUCTION MANUAL
==========================================================================
0.A. IF YOU DON'T HAVE AN EXECUTABLE PROGRAM. Try out the DOS or Sun
executable or read section I below to be sure that ABaCUS can solve the
type of problem that you are interested in. If so, read section I.C.
below, then open the header file "abacus.h" with a text editor. Read the
instructions therein, make the necessary changes, and proceed. You may
also want to check out section V.A. to help in tailoring ABaCUS to your
needs.
0.B. IF YOU ALREADY HAVE AN EXECUTABLE PROGRAM. First, read section I
(short). Next, be sure that the the executable ("abacus.exe" in DOS and
"abacus" in SunOS) and the data file "pdb1ccrs.txt" are on your hard drive,
in the same directory. For DOS users who wish to use graphics, be sure to
include the graphics interface file (with the ".bgi" extension) appropriate
for your hardware. Then read and perform the tutorial excercises, using
ABaCUS. Plan on spending 30-60 minutes on the tutorial. This should be
enough to familiarize you with the steps involved in preparing and
analyzing data.
O.C. IF YOU'RE NOT SURE ABOUT SOMETHING. This document represents a large
amount of work devoted to explaining how ABaCUS works and how to use it.
Please consult this manual for explanations of how data are handled and how
operations are carried out. For questions about the meaning of statistical
results of simulations, ask your local statistical consultant. For
bit-twiddly questions, consult the code, which is heavily commented. For
questions about the interpretation of results in the context of the
evolution of introns, a good place to start is the general review by
Doolittle (1987). Also, see Gilbert and Glynias (1994) and Stoltzfus, et
al. (1994). As a last recourse, ask the authors for help, preferably by
E-mail, at one of the addresses listed below. If you are carrying out a
research project involving correspondences between split gene structure and
protein structure, we would be happy to hear about it, even if you don't
have any questions, and even if you don't find any correspondences.
Dr. Arlin Stoltzfus and Dr. David Spencer
Canadian Institute for Advanced Research
Program in Evolutionary Biology
Department of Biochemistry
Dalhousie University
Halifax, Nova Scotia B3H 4H7 CANADA
internet: arlin@ac.dal.ca
phone: 902-494-3569
facsimile: 902-494-1355
==========================================================================
I. ABOUT ABACUS
==========================================================================
I.A. BASIC DESCRIPTION
ABaCUS is a no-frills program to investigate the significance of the
putative correspondence between exons and units of protein structure. This
type of analysis takes the form of an attempt to eliminate the reference
hypothesis (sometimes called a "null" hypothesis) that no correspondence
exists. A reference hypothesis in this case consists of a reference model
for random gene structures, and a scoring rule for quantifying
correspondences (in principle, a test could be done by generating random
protein structures instead of random gene structures, but this is
impracticable). ABaCUS creates and reads files containing observed data
supplied by the user, then uses this information to generate reference
genes according to one of several available models. The observed and
reference genes are then scored according to a correspondence rule
designated by the user, and the scores are compared in order to determine
whether the reference hypothesis (i.e., no correspondence) can be rejected.
I.B. HARDWARE AND SOFTWARE REQUIREMENTS FOR THE PRE-COMPILED DOS VERSION
The compiled program "ABaCUS.exe" runs in DOS. The minimal DOS platform is
a 286-based PC-compatible computer with a monochrome monitor. Monochrome
or color graphics are possible (drivers are provided for EGA, VGA, CGA and
Hercules). If you are not sure which driver to use, just include all of
the drivers in the same directory (ABaCUS will automatically use the
correct one). There is also a precompiled SunOS version, which does not
have graphics and thus requires no additional files.
I.C. COMPILING THE ABACUS CODE FOR ANOTHER ENVIRONMENT
All of the important parts of ABaCUS are portable to non-DOS environments.
The graphics portion-- which is available only in the DOS environment, and
is dependent on the Borland graphics library-- is interesting but not
central to the task of hypothesis-testing. An ANSI-C-compliant version of
ABaCUS has been compiled and run in BSD UNIX (using the Gnu C compiler
2.4.0 on a Sun running SunOS 4.1.2; also on a NeXT). To compile ABaCUS,
one needs the main code block "abacus.c" and the header file "abacus.h".
All alterations are made within the header file, which contains
instructions for conditional compilation.
If you have gotten an ABaCUS package from an Internet server, the ".readme"
file associated with the package will give further information on
compilation for specific environments.
I.D. CITING ABACUS IN PUBLISHED WORK; PROPRIETARY CLAIMS
A manuscript describing ABaCUS is in preparation (Stoltzfus and Spencer,
1998). For now, please cite "A. Stoltzfus and David Spencer, personal
communication" as the source of ABaCUS, and refer to Stoltzfus, et al.
(1994) for its use in analyzing correspondences.
Because ABaCUS is a scientific application designed to aid in resolving a
biological question, it is available to the general public. The code has
no copyright at present, and may be distributed freely. We encourage
interested biologists to analyze their data using ABaCUS, and to report the
results (whether positive or negative) in trade journals. We would be
delighted to receive a preprint or reprint of any manuscript describing
analyses performed using with ABaCUS.
==========================================================================
II. TUTORIAL: AN ANALYSIS OF CYTOCHROME C DATA
==========================================================================
An analysis falls into three stages:
A. gathering and collating observed data;
B. creating data files using ABaCUS;
C. evaluating correspondences using ABaCUS.
The user must supply the data (sequence information) and the tools (e.g.,
an alignment program) to collate it. ABaCUS provides the remaining
accounting and computational tools. Once the data are prepared, analyses
can be carried out in a single session lasting from a few minutes to a few
hours (depending on the complexity of the case and the computing power
available). The operations involved in each stage of a typical analysis
are described in the tutorial and in section III below.
II.A. STAGE 1: PREPARATION OF THE CYTOCHROME C DATA
The data have already been prepared, as follows.
II.A.1. Protein structure. The structure of rice cytochromeC in the file
named "pdb1ccr.ent" was chosen (arbitrarily) from among three cytochrome C
structures at the Brookhaven PDB that have a very fine resolution, of 1.5
Angstroms. In addition to atomic coordinates, the file pdb1ccr.ent
contains a list of the boundaries of alpha-helices (there are no
beta-strands in cytochrome C).
II.A.2. Intron-containing sequences. Kemmerer, et al (1991a, 1991b) listed
a total of 5 distinct intron positions found in cytochrome C genes of rice,
drosophila, arabidopsis, human, chicken, and mouse. A search for
additional distantly related intron-containing sequences in GenBank yielded
one gene, from Aspergillus nidulans, containing two intron positions
(Raitt, et al., 1994). Alignments of the inferred amino acid sequences of
all of these intron-containing genes indicate that there are a total of 6
distinct intron positions, which can be represented in a minimal set of
four sequences, from Arabidopsis, rice, chicken, and Aspergillus. It is
possible that this set does not represent all currently known distinct
intron positions, since there are literally hundreds of cytochrome C
sequences in GenBank, and my search procedure did not involve screening
each entry for potentially novel intron positions.
II.A.3. Alignment with reference protein. The complete rice sequence
(corresponding to the crystal structure) contains 111 residues, but only
the latter 103 residues align with other cytochrome C sequences. Therefore,
a text editor was used to delete data for the first 8 residues: the
resulting shortened file is called "pdb1ccrS.TXT". This file has been
included with the ABaCUS package. The positions of the 6 introns relative
to the canonical-length sequence of rice cytochrome C are:
source intron
taxon position
Arabidopsis 12-0
rice 29-1
animals 56-1
Arab., Asp. 65-0
rice 74-0
Aspergillus 96-2
The positions of alpha-helices relative to the canonical-length sequence of
rice cytochrome C are:
left & right
structure boundaries (inclusive)
helix1 2 to 14
helix2 49 to 55
helix3 60 to 69
helix4 70 to 75
helix5 87 to 103
II.B. STAGE 2: CREATING THE NECESSARY DATA FILES
II.B.1. Enter the observed intron positions. Enter the size of the gene
as 103 codons and the number of introns as 6. Then input the numbers in
the table of intron positions above. When entering the intron positions,
separate the codon and phase using one or a few spaces. Use the "v=VIEW"
command to see the intron positions. The console should look like this:
OBS: 33 85 166 192 219 287
SCORE: 0.0 0.0 0.0 0.0 0.0 0.0
This means that the first intron is after the 33rd coding nucleotide of the
canonical-length mRNA, that is, the 33rd inter-nucleotide site (an mRNA of
N nucleotides has N-1 possible intron positions, or inter-nucleotide
sites). If the intron positions entered were correct, then save them to a
file named "cytobs.int" (short for "cytochromeC observed introns").
If the intron positions entered were correct, and the number of codons
entered was correct, then ABaCUS has also created a correct set of exon
sizes. The set should look like this:
OBS: 11 18 27 8 9 23 7
SCORE: 0.0 0.0 0.0 0.0 0.0 0.0 0.0
This means that the first 11 residues of the protein are assigned to the
first exon, the next 18 to the second exon, and so on. Notice that there
are 7 exon sizes for 6 intron positions, and that exon sizes are in codons
(or residues), while intron positions are on a nucleotide scale. If the
exon sizes are correct, save the exon sizes to a file called "cythyp.exn"
(short for "cytochrome hypothetical exons").
II.B.2. Enter the boundaries of the 5 helices.
Go to the "d=DISCRETE" elements submenu, and choose "e=ENTER". Enter the
gene size as 103 codons, and enter the left and right boundaries of helix1,
using the numbers in the table above. That is, enter 2 and 14 for the left
and right boundaries of helix1. Continue ("c=continue") entering elements
until all five have been entered. Then choose "d=done". Choose "v=view"
to view the array, which will be a string of 1's and 0's. If the secondary
structure elements were entered correctly, the bottom of the display should
show the following message:
The average score per position is 0.500000.
This is the average score for positions in the array. In this case, it
happens (quite by chance!) that exactly half of the 308 possible intron
positions (103 codons --> 309 bp --> 308 inter-nucleotide sites) are
internal to structural elements. If the elements were entered correctly,
save this array to a file named "cytsec1.arr".
II.B.3. Convert the maximum array score.
Now choose "c=convert" to convert the array to a new maximum score. Enter
9999 for the maximum score, and save the converted array to a file called
"cytsecm.arr".
The array created in step II.B.2 had a maximum score of 1, and could be
used to give binary scores to introns: that is, 0 is assigned to intron
positions between structural elements, and 1 is assigned to intron
positions within structural elements. Converting the array to a high
maximum score creates a graduated array in which each number in the array
is the distance in bp to the nearest element-free region. Recall that the
first helix began at residue 2. Therefore, the first three intron
positions, 1-1, 1-2, and 2-0, fall in an inter-helix region, whereas the
next introns, 2-1, 2-2, 3-0, etc are successively more deeply embedded in
helix1. The first 65 numbers (representing the inter-nucleotide sites in
the first 22 codons) should look like this:
0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 19 18 17 16 15 14 13
12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The distance scores continue to increase until site 8-2 (19 nucleotides
from the carboxy end of helix1), then they decrease as the carboxy end of
helix1 is approached. The last site that can be considered "inside" helix1
is 14-2 (1 nt from the carboxy end of helix1); the next site, 15-0, is
"outside" helix1, and has a score of 0.
Although there are some circumstances in which one wishes to limit the
maximum score (e.g., to 9 or 15), one usually wants a completely graded
array, and 9999 is sufficiently high to ensure that the maximum achievable
score will be reached in any gene (unless its > 19998 bp in length!).
II.B.3. Load the crystal structure of rice cytochrome C.
Go to the "a=ATOMIC COORDINATES" submenus and choose "l=LOAD". Enter the
name of the file, which is "pdb1ccrs.txt". After the file has been read, a
warning message will appear, indicating that the numbering in the file was
non-consecutive. This does not necessarily mean that the file has been read
incorrectly-- for instance, the chicken TPI crystal structure (PDB file
1tim) has no residue #3 (the numbering in the file is 1, 2, 4, 5, 6 . . .
246, 247, 248, but there are really only 247 residues). In the case of
"pdb1ccrs.txt", the first 8 residues of pdb1ccr.ent were deleted, and the
103 residues in "pdb1ccrs.txt" are thus numbered 9-111 instead of 1-103.
The atomic coordinates maintained in memory by ABaCUS have the correct
numbers, 1-103, because ABaCUS assigns its own, consecutive, numbering
system as it reads the file.
Now quit ABaCUS, and find the file "calpha.xyz". This file, which was
written automatically by ABaCUS when "pdb1ccrs.txt" was read, contains only
the C-alpha coordinate lines from the original file, and thus the file is
10-50 times smaller than the original. Change the name of the file from
"calpha.xyz" to "pdb1ccrs.xyz". Since we know that the original file has
been read correctly, we can use "1ccrsca.xyz" in place of it, to save
space. Every time ABaCUS loads a crystal structure, it creates a file
called "calpha.xyz" with the C-alpha data. This file can be used to check
whether the crystal structure has been read correctly and, if so, it can be
used in place of the original PDB file.
II.C. STAGE 3: ANALYZING CORRESPONDENCES WITH THE CYTOCHROME C DATA SET
Below are instructions for testing 3 hypotheses about the cytochrome C
data. Each hypothesis involves a choice of a scoring rule and a reference
gene model. The general form of each hypothesis will be that the observed
gene data do not correspond (as quantified using the chosen scoring rule)
to protein structure better than random introns or exons (generated by the
reference model).
II.C.1. Load the observed data files. Restart ABaCUS, and load the exon
data file, "cythyp.exn", using the "l=LOAD" command in the main menu; load
the array "cytsecm.arr" using the equivalent command in the "d=DISCRETE"
submenu; and load the atomic coordinates in "1ccrsca.xyz" (or in
"pdb1ccrs.txt", if you prefer to use the original file) using "l=LOAD" in
the "a=ATOMIC COORDINATES" submenu. We'll load the intron position data
later.
II.C.2. Generate reference genes.
II.C.2.a. Generate reference intron positions.
Go to the "r=REFERENCE" genes submenu and choose "u=UNIFORM" intron
positions. Unless you loaded the intron position file in step 1 above,
ABaCUS gives an error message to the effect that reference intron positions
cannot be generated unless observed intron positions have been loaded. The
random reference gene data must reflect the properties of the observed gene
data-- same number of introns, same gene length-- and therefore ABaCUS
requires observed intron positions before it will generate reference intron
positions. Exons are treated separately, but they follow the same rules.
Go back to the main menu and load the intron position data from
"cytobs.int" then return to the "r=REFERENCE" genes submenu. Using
"u=UNIFORM", generate 1000 sets of uniform random intron positions, with
the minimum inter-intronic distance set to 1 bp. Specifically, this model
of random intron positions creates 1000 sets, each with 6 non-identical
positions randomly drawn with uniform probabilities per inter-nucleotide
site. This is the reference model for randomly placed introns.
II.C.2.b. Generate reference exon sizes.
Go to the "r=REFERENCE" submenu and choose "p=PERMUTE" exon sizes. Ask for
2 sets of permuted exon sizes. Go back to the main menu and view the exon
sizes on the console. It is easy to see that each set of exons contains
exactly the same sizes as the other sets-- the only difference is in the
order. Now generate another two sets and view them by returning to the
main menu and choosing "v=VIEW". Notice that there are not 4 reference
sets, but only 2. This is because ABaCUS erases the previous list of 2
sets and replaces it with the new list of 2 sets. ABaCUS can only keep ONE
list of random exons in memory, and the list is erased and rewritten every
time reference genes are generated. Intron positions are stored
separately, but they follow the same rule.
Go back to the reference genes submenu and generate 1000 sets of randomly
permuted reference exon sizes. This is the reference model for exon sizes.
II.C.3. Assign scores and evaluate the reference hypothesis.
II.C.3.a. Assign scores and evaluate centrality of intron positions.
This hypothesis, which we could call HC, for hypothesis regarding
centrality, is that the intron positions do not correspond to central
locations in the three-dimensional crystal structure better than randomly
placed introns. The alternative is that introns tend to correspond to
positions at the center of the protein. To carry out this test, we need
observed intron positions, randomly placed intron positions, a crystal
structure, and a method of measuring centrality. The first three things
are already taken care of. Now all we need to do is measure the centrality
of the observed and random intron positions, compare them, and draw a
conclusion.
Go to the "a=ATOMIC COORDINATES" menu and choose "c=CENTRALITY". Indicate
that cytochrome C has only a single globular domain, and choose rule #4
(this is the most logical rule for centrality; the other rules are not
generally useful). ABaCUS will assign centrality scores to all sets of
observed and reference introns, using the crystal structure in memory. Now
return to the main menu, choose "t=TEST" and examine the results (add a
comment at the prompt, if desired). Can the reference hypothesis, HC, be
excluded?
II.C.3.b. Assign scores and evaluate avoidance of secondary structures.
The second hypothesis, HAS, is that the intron positions do not tend to
avoid secondary structural elements better than randomly placed intron
positions. The alternative is that intron positions tend to fall between
secondary structures, or at least very close to their ends. The observed
and random intron positions have already been generated (they are still in
memory from the previous test). The scoring rule to be used in this test
consists of the scores in the array "cytsecm.arr".
Go to the "d=DISCRETE ELEMENTS" submenu, and choose "a=ASSIGN" to assign
scores to the intron positions using the scoring array in memory. Since
the array created earlier holds the distance in bp from each potential
intron position to the nearest inter-element boundary, this is the score
that the introns will receive. Return to the main menu to finish the test
by choosing "t=TEST".
At this point, take a break to notice several things about scoring rules.
First, notice that the score assigned to a gene is the average of the
constituent exon (or intron) scores. This is true for all of the scoring
rules used by ABaCUS. Second, in all of the scoring rules used by ABaCUS,
a lower scores indicates a better correspondence. For centrality, a low
score means greater proximity to the center (the center of mass, to be
exact) of the protein; for avoidance of secondary structure, a low score
means that the distance to the nearest interÐelement region is small-- the
introns are within, or close to, inter-element regions.
Also, notice some things about the ABaCUS environment. The same list of
1000 sets of reference intron positions was used in two different tests.
This is perfectly valid, and is actually preferable to generating separate
sets for each test. The sets of intron positions stayed in memory, but the
scores changed when a new scoring rule was chosen.
II.C.3.c. Assign scores and evaluate the extensity of exon-encoded
peptides.
The third hypothesis, HE, is that the peptides encoded by exons are no less
extended than those encoded by random exons. The alternative is that
exon-encoded peptides tend to be non-extended or compact. The observed
data are already loaded, and the reference model (in this case, random
permutations of the observed order of exon sizes) has already been chosen.
It remains to choose a scoring rule, assign scores to the observed and
reference exons, and evaluate the hypothesis.
Go to the "a=ATOMIC COORDINATES" submenu and choose "e=EXTENSITY" scores.
Assign scores to the exons using rule "r=radius of gyration" (this is, in
our opinion, the most sensible rule for extensity: the other rules are
explained in section III). Now return to the main menu and choose "t=TEST"
to evaluate the hypothesis.
Before quitting, take a moment to see how ABaCUS maintains records on past
and current experiments. This information is accessed using the "i=INFO"
command in the main menu. Choose this command, then choose "p=past" to see
the results of the three experiments that have been performed. Now choose
"i=INFO" again and choose "c=current" to see descriptions of the data that
are now in memory. In general, the "i=INFO" functions are useful for
keeping track of what has and has not been done during a session.
When "q=QUIT" is chosen from the main menu, you will prompted for the name
of a file in which to save the results of the experiments performed. Name
the file "tutorial.sum". The file will contain the information on past
experiments that we viewed above.
===> This is the end of the tutorial. Section III provides generalized
instructions for each of the steps done in the tutorial, and Sections IV
and V provide details.
==========================================================================
III. GENERAL STEPWISE INSTRUCTIONS
==========================================================================
III.A. STAGE 1: COLLATE THE DATA PRIOR TO USING ABACUS
Most of the effort in analyzing gene-protein correspondences will be spent
preparing an observed case for analysis. Plan to devote a large amount of
time to carrying out the following tasks: searching sequence databases to
find known intron-containing sequences, checking the primary research
literature to be sure that intron positions are correctly assigned, and
aligning sequences with each other, as well as with protein structural
elements. The following sequence of steps is recommended:
III.A.1. Choose a protein for which intron-containing genes have been
sequenced, and for which a crystal structure is known.
III.A.2. Obtain a file containing atomic coordinates of the protein from
the PDB. If there are several homologous structures to choose from, pick
the one that is the best characterized (best refinement, most additional
information on structural features).
III.A.3. Make a list of boundaries of secondary structures and other
structural elements. For example, PDB files often include a list of the
boundaries of secondary structural elements.
III.A.4. Search sequence databases to find all the known intron-containing
genes. Align the inferred amino acid sequences with each other and with
the protein whose structure has been determined.
III.A.5. Make a list of all known intron positions in codon-phase notation
relative to the protein whose structure has been determined. That is, for
each intron, write down the corresponding residue number in the protein
(each codon corresponds to a residue in the reference protein) and its
phase (0, 1 or 2). An intron between codons 59 and 60 is 60-0 (codon 60,
phase 0) in the notation of Dibb & Newman (1989).
III.A.6. If an analysis of extensity is to be done, make a list of
inferred ancestral intron positions. This list will be the same as the
list of observed intron positions unless there are intron positions that
are not separated by the first nucleotide of any codon (e.g., 29-1 and
29-2, or 29-1 and 30-0), or unless an "intron sliding" assumption is made
on the basis of some looser criterion (for example, see Gilbert & Glynias,
1994). For further explanation, read the entire section IV.A., entitled
INTRONS, EXONS AND INFERRED ANCESTRAL EXONS.
Before starting ABaCUS, double-check that all positional data are numbered
according to the same codon/residue numbering scheme, based on a multiple
sequence alignment. For example, suppose that I am using the atomic
coordinates and secondary structure boundaries for bovine dibibliomuctase.
If the 199th codon of the rat dibibliomuctase gene has an intron in phase
1, and if the multiple sequence alignment shows that the encoded residue is
homologous to the 193th residue of the bovine sequence, then that intron
should be designated as position 193-1, not 199-1. If the bovine protein
has a beta-strand at 185-191 and an alpha-helix at 195-211, then the
incorrect intron assignment would place the intron in the middle of the
alpha-helix, instead of where it belongs, between the beta strand and
alpha-helix. Check and double check the data (see section IV.C. BE CAREFUL
WHEN ENTERING DATA). Obtaining a low-quality result by doing a
sophisticated analysis on low-quality data is called "garbage in, garbage
out."
III.B. STAGE 2: ENTER THE OBSERVED DATA AND SAVE THEM TO FILES
NOTE: Before you start ABaCUS, make sure that the relevant crystal
structure file (if necessary) and the executable file or files (in DOS,
look for "abacus.exe" and either "egavga.bgi" or another appropriate BGI
graphics driver) are all in the same directory. Also, have ready the lists
of intron positions and structural boundaries. To launch ABaCUS, type
"abacus".
NOTE: The files created in this step should be kept in the same directory
as abacus.exe. They can then be read back at any time. Its a good idea to
keep a list of the file names and a description of what each file contains,
unless you are running ABaCUS within a console (e.g., DOS in Windows) and
can examine files from the shell without quitting ABaCUS.
III.B.1. Enter the observed intron positions, then save the intron
positions to a file with the ".int" extension. Enter the inferred
ancestral intron positions, then save the resulting inferred ancestral
*exon sizes* to a file with the ".exn" extension. To find out more about
intron positions and exon sizes, and why they are treated separately, see
section IV.A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS.
III.B.2. Enter the boundaries of structural elements, then save them to a
file with the ".arr" extension. If desired, convert the maximum penalty in
the scoring array to a different value, then save the converted array with
a different name. Repeat this process for each different type of
structural element that is being considered. For more information, see
section IV.B. on arrays.
III.B.3. Attempt to load the crystal structure file. If there is no
apparent problem, check the crystal structure by viewing its diagonal plot
(if you have the DOS graphics version), or by comparing the cryptic output
file "calpha.xyz" (which contains only the CA lines) with the original
file. If any discrepancy is noted, see section IV.D. on loading atomic
coordinates from a pdb file, and correct any problems before continuing.
III.C. STAGE 3: EVALUATE CORRESPONDENCES
The intron-based analyses in III.C.1 and III.C.2 below should ideally be
done together (in either order), since the same set of reference intron
positions can then be used for both analyses (this is what was done in the
tutorial excercise). The exon-based analysis, section III.C.3, can be done
before or after the intron-based analyses.
III.C.1. Evaluate intron positions with respect to structural elements:
a. load the observed set of intron positions;
b. generate reference sets of uniform or PIID introns;
c. load the structural element scoring array;
d. score the introns using the scoring array;
e. evaluate the scores;
Repeat steps steps c-e as required for other types of structural
elements (there is no need to generate a new set of reference intron
positions for each analysis).
III.C.2. Evaluate intron positions with respect to centrality. You will
be prompted in step (d) to answer whether the protein has multiple globular
domains and, if the answer is "yes", you will be prompted to supply the
number and boundaries of the globular domains. Steps (a) and (b) will not
be necessary if they have already been performed:
a. load the observed set of intron positions;
b. generate reference sets of uniform or PIID introns;
c. load the crystal structure;
d. score the introns by centrality;
e. evaluate the scores.
III.C.3. Evaluate the extensity of exon-encoded peptides.
a. load the inferred ancestral exon sizes;
b. generate reference sets of lognormal or permuted exon sizes;
c. load the crystal structure (if not already loaded);
d. score the exons by extensity of exon-encoded peptides;
e. evaluate the scores.
III.C.4. Save the results. If any experiments have been performed,
choosing "q = quit" will give you the option of saving a numbered list of
experiment summaries to disk. Name the file using the ".sum" extension.
==========================================================================
IV. DETAILED COMMENTS
==========================================================================
IV.A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS
IV.A.1. How intron positions are handled.
Intron positions are entered by the user in codon-phase notation (Dibb and
Newman, 1989) and are then transformed to a scale of nucleotides, such that
the intron is given the number of the gene nucleotide that precedes it. The
formula is thus:
position = 3 * (codon - 1) + phase
For example, if the gene is 146 codons long, then it has 438 nucleotides
and 437 possible intron positions. Intron 68-0 (codon-phase) is at position
201 (bp scale). Thus, the intron positions used by ABaCUS exactly preserve
the information entered by the user.
IV.A.2. How exon sizes are handled.
Exon sizes are only used in conjunction with a crystal structure for
evaluating the extensity of exon-encoded peptides. By contrast to intron
positions, exon sizes are always rounded to integral numbers of codons,
such that a partial codon is assigned entirely to the 5' exon. Therefore,
if the first intron in a gene is at position 38-0, the first exon will be
37 codons long, but if the first intron is at 38-1 (or 38-2 or 39-0), the
first exon is considered to be 38 codons long.
IV.A.3. Why exons and introns are handled differently.
The reason that exon sizes are NOT in bp, but in codons, is that the
exon-based scoring done by ABaCUS utilizes the atomic coordinates of
alpha-carbons. Each exon must be found to correspond to a unique set of
alpha-carbons, and thus no resolution is to be gained by expressing exon
sizes in bp. Using integral numbers of codons also simplifies several
procedures, especially the generation of lognormally distributed exon
sizes.
For the case of intron positions, using a nucleotide scale allows
potentially useful resolution with regard to the boundaries of structural
elements: for instance, if there is a helix encoded by codons 9 to 16, then
there is a non-arbitrary (though possibly trivial) sense in which introns
at 9-0 and 17-0 DO NOT interrupt the helix, whereas introns just at 9-1 and
16-2 DO interrupt the helix. By contrast, in deciding how exons correspond
to sets of C-alpha carbons, we can only make an arbitrary choice about
whether 9-0 and 9-1 both separate residue 8 from residue 9, or whether 9-1
should be treated as though it separates residue 9 from residue 10.
IV.A.4. Prohibited exon sizes.
Any listing of intron positions is allowable, as long as the positions are
entered consecutively and they do not fall outside the stated boundaries of
the gene. However, some allowable configurations of intron positions
cannot be converted by ABaCUS into exon sizes, since exon sizes must be
whole numbers. For instance, if the user enters intron positions at 245-1
and 246-0, the exon sizes will not be calculated correctly, since both of
these introns would (by the rule described above in IV.A.2) separate
residue 245 from 246. If the exon size cannot be resolved as a whole
number, then the user must change the set of intron positions accordingly.
In this case, the solution would be to combine the two intron positions,
and enter the average of the two values. The evolutionary rationale for
doing this is explained in the next two sections.
IV.A.5. Inferred ancestral exon sizes.
According to the exon theory of genes, introns are lost but not gained.
Therefore, each intron position is thought to represent an intron that
physically existed when the gene was first assembled billions of years ago.
In addition, each intron position has a unique set of scores for any
conceivable correspondence metric, and the scores are unaffected by other
intron positions (i.e., the score for position X is 5, whether or not there
is another intron at position Y). Consider some of the cytochrome C
introns listed earlier:
Arab., Aspergillus 65-0
rice 74-0
Aspergillus 96-2
The same ultimate conclusion would result if we analyzed each intron
separately, and then combined the data, or if we list all of the introns
together, since the positions are still the same (65-0, 74-0 and 96-2).
Exons sizes are not like this. For instance, the real cytochrome C gene of
Aspergillus has an exon extending from the first nt of codon 65 to the
second nucleotide of codon 96. According to the exon theory of genes, the
introns flanking this exon must have existed in the ancestral gene, but the
exon did not necessarily exist in the ancestral gene. Instead, because an
intron is found in rice at position 74-0, the observed exon from 65 to 96
in Aspergillus would NOT have been in the ancestral gene (according to the
exon theory of genes), but would have been divided by an intron at position
74-0. By combining the intron positions from various genes, we infer a
hypothetical set of ancestral exon sizes. In this case, there are no real
exons to correspond to any of the inferred ancestral exons (for instance,
the gene from rice has an exon extending from the first nucleotide of codon
74 to the end of the gene, but in the inferred ancestral gene this would be
broken by the intron at position 96-2). This is why these exon sizes are
referred to as *hypothetical* or *inferred* ancestral exon sizes.
IV.A.6. Intron "sliding".
Advocates of the exon theory of genes maintain that intron positions within
a few codons of each other must represent the same ancestral position that
has migrated, or "slid", to different positions in descendent genes.
Suppose that we find an intron position in cytochrome C at position 75-2,
just 5 nt away from the intron position at 74-0 in rice cytochrome C.
According to the exon theory of genes, the ancestral gene did NOT contain
an exon extending from 74-0 to 75-2, and including an exon of this size in
an analysis would therefore not be consistent with the assumptions of the
exon theory of genes. Instead, the ancestral gene is posited to have had a
single intron position represented by both of the extant positions at 74-0
and 75-2.
Invoking "sliding" creates two problems. First, how does one decide when
introns are too close to have co-existed in the ancestral gene? Second,
given a criterion for the first problem, how does one decide on the
position of an ancestral intron that may have left descendants at
non-identical positions? In passing we note that (based on our own
preliminary analyses) intron positions probably do not exhibit non-random
clustering patterns (for an intuitive look at this problem, see section
IV.E.3.a on the reference model of uniform intron positions), therefore no
criterion of closeness can be justified. Because of this, the whole issue
of "sliding" is probably a non-issue based on a non-phenomenon: either
"sliding" is so rampant that all clusters are dispersed to non-significant
levels, or it is so rare that significant clusters of intron positions do
not arise.
Nevertheless, in order to test the exon theory of genes, one must proceed
in a manner that is consistent with its assumptions (even if they cannot be
justified on prior grounds), and this means invoking "sliding" to explain
away any excess intron positions. The rigorous way to do this is to pick a
precise rule and stick with it. Our rule is to consider all cases of intron
positions within 3 codons of each other as cases of "sliding", and to
estimate the position of the ancestral intron by taking the average of the
extant intron positions.
Note well that the need to invoke sliding would only arise when performing
tests directly on exon sizes, not on intron positions. Even if "sliding"
occurs, it is a more conservative test of intron positions to include all
of the observed data than to use an additional assumption to amalgamate
some of the observed data into hypothetical ancestral data.
IV.B. ARRAYS: CREATING, CONVERTING, SAVING AND LOADING
IV.B.1. Creating an array.
The scoring arrays used by ABaCUS are linear arrays of integer penalties
associated with each possible intron position in a gene. The penalties are
assigned based on protein structural elements defined by the user. For
example, consider an imaginary protein of 20 amino acids. This protein
would be encoded by a gene with 20 codons, or 60 nucleotides. Thus, there
would be 59 inter-nucleotide positions at which an intron might be found.
Suppose that the protein has two alpha helices, one encompassing residues
3-12 and the other residues 13-19. Entering these boundaries into ABaCUS
will produce the following array of scores for each intron position:
00000011111111111111111111111111111011111111111111111111000
The array can be used as it is to score correspondences. Imagine that
there are introns at position 5-0, 13-2 and 16-0. When scored by the above
matrix, each of these introns would be assigned a score of 1 (introns in
codons 1, 2 and 20 would receive a score 0, as well as introns at positions
3-0 and 13-0)
IV.B.2. Converting an array.
For most purposes, the array will be converted using a different maximum
penalty (i.e., greater than 1), which is done with the "c=convert" function
in the array submenu. With a maximum score of 9, the array shown in IV.B.1
would look like this:
00000012345678999999999999987654321012345678999987654321000
Using this array to score introns would be equivalent to deciding that the
score for an intron will be the distance to the nearest inter-element
region, up to a maximum of 9 bp (3 codons).
IV.B.3. Saving, viewing and loading arrays.
Array files can be saved and loaded by ABaCUS. The view command displays
the array currently in memory, and calculates the average score for the
array. These procedures are simple and require no further explanation.
IV.C. BE CAREFUL WHEN ENTERING DATA
ABaCUS has some smart menu handling features, in that it usually does not
carry out nonsense operations in response to menu choices by the user. For
instance, ABaCUS will not allow an attempt to draw a diagonal plot unless a
crystal structure resides in memory. Likewise, when responding to the
"t=TEST" command, ABaCUS will give an error message if no set of gene data
is ready to test; if one set of data is ready, ABaCUS will test that set;
if both sets are ready, ABaCUS will prompt the user for a choice.
However, ABaCUS does not trap nonsense when the user is entering data on
intron positions and boundaries of structural elements, or when the user is
supplying parameters. For instance, if the user enters intron positions in
non-consecutive order, this will create nonsense in downstream events.
Likewise, if the boundaries of structural elements entered by the user are
inverted, this will create nonsense in downstream events.
For these reasons, it is recommended that the user enter all data and save
them to files well before attempting to perform an analysis. Immediately
after entering data, view the data using the appropriate v=view function,
check for obvious errors, then save the data to disk. Check the resulting
file for errors before proceeding with an analysis. Carefully record the
number of codons for a gene. Be sure that sets of intron positions, sets
of exon sizes, arrays, and atomic coordinates all match exactly in length.
IV.D. LOADING ATOMIC COORDINATES FROM A PDB FILE
Some PDB files can be read directly by the program, but some of them have
to be edited. Specifically, ABaCUS will choke in the following cases:
a) for multi-subunit crystal structures, due to the optional "subunit"
field, which contains a single letter ( "A", or "B", for instance). Delete
the data for one of the subunits, then remove the subunit designator from
the remaining lines (i.e., use a text editor to search for " A " and
replace it with " ").
b) when the third and fourth fields run together due to long descriptors
for alternative side chain conformations. The solution to this uncommon
problem is to separate the fields by inserting spaces.
The file reader only extracts data from the "CA" lines, for C-alpha
carbons. If the crystal structure has been read incorrectly, this should
be obvious in the distance plot. If necessary, troubleshoot the editing
process by looking at the cryptic output file "calpha.xyz": this file
(rewritten each time a crystal structure is entered) echoes the information
from the crystal structure file that ABaCUS has read and successfully
stored in memory. Note that for its internal use, ABaCUS renumbers the
residues in the order they are read. The output file will retain the
numbering in the original, even if it is non-consecutive. For a 10- to
50-fold savings in disk space, throw out the successfully read PDB file and
replace it with ABaCUS's version of the file (be sure to rename it, to
anything other than "calpha.xyz", or it will be overwritten by ABaCUS).
IV.E. GENERATING REFERENCE GENE DATA
IV.E.1. Why not "null" gene data instead of "reference" gene data.
Speaking of a "null" hypothesis tends to imply that there is a single
standard of nothingness or randomness against which the world can be judged
to determine its somethingness or non-randomness. The words "null" and
"random" tend to obscure the fact that a "null" or "random" model often
involves complex assumptions, such as the complex reference models used by
ABaCUS. Speaking of a reference model (instead of a "null" model) implies
that we must be acutely concerned as to whether the form of the model and
the parameters chosen are appropriate to serve as a reference for testing
the sort of thing that we are interested in testing.
IV.E.2. Logic of reference models.
Reference models are used to generate sets of reference genes that have
some of the properties of the observed data (e.g., same distribution of
exon sizes). For the case of ABaCUS, the most important aspect of the
reference algorithms is that they do not employ information on the protein
structure. That is, imagine that I launch ABaCUS, input intron positions
for my favorite gene, and then generate reference intron positions by one
of several models. Since I haven't entered any other data, ABaCUS knows
nothing about the protein structure, and therefore I can rest assured that
the introns will be placed randomly with regard to the structure of the
protein.
If the reference model accurately reflects the important properties of the
observed intron data, then, the resulting reference hypothesis has the
following form:
THE OBSERVED SET OF INTRON POSITIONS (or exon sizes) DOES NOT CORRESPOND TO
PROTEIN STRUCTURE BETTER THAN IS EXPECTED AT RANDOM, GIVEN THE PROPERTIES
OF THE OBSERVED POSITIONAL DISTRIBUTION OF INTRONS (observed size
distribution of exons)
IV.E.3. Implementation of Reference models.
Once an observed set of introns is in memory, reference sets of intron
positions can be generated; once a set of observed (or inferred ancestral)
exons are in memory, reference sets of exons can be generated. The
"r=REFERENCE" genes submenu calls five generators. Each reference gene
generator can create a user-specified number of reference genes, each of
which is a set of either J intron positions (bp) or K exon sizes (in
codons), where J and K are the numbers of observed introns and hypothetical
exons currently in memory, respectively. The user may specify hundreds or
thousands of sets of reference exons or introns at a time (see IV.E.4
regarding the number of sets to choose). Output from the reference gene
generators may be saved by the user, as described in section V.C.5.
IV.E.4. Descriptions of Reference models.
IV.E.4.a. Uniform random introns.
This function creates sets of uniformly distributed introns. The minimum
distance between introns in a set is 1 bp (i.e., no position is chosen
twice in a single set), unless the user specifies a higher number. The
option to change the minimum distance is useful for gaining an intuitive
sense for the random likelihood of closely-spaced introns-- some authors
have claimed that introns within a few bp of each must have arisen by some
special process of intron "sliding", but this is not true. The screen
display, which shows the number of attempts needed to complete each set,
reveals how very often a randomly distributed intron falls 0, 1, 2, 3, etc.
positions away from a previously existing intron.
IV.E.3.b. Introns by permuted inter-intronic distances.
This function temporarily converts the observed set of intron positions
into a set of inter-intronic distances, permutes these numbers randomly to
generate random sets, then converts them back into intron positions. As
with the function for permuting exons, large numbers of simulations should
not be done from small numbers of intron positions (e.g., fewer than 10).
IV.E.3.c. Lognormal exon sizes.
This function creates random exons with the same lognormal mean and
standard deviation as the observed set of exons in memory. Since most such
sets of exons will not add up to the length of the observed gene, and since
this condition is necessary, most sets of lognormal exons are discarded (as
will be apparent from the display shown by this function). Imposing this
condition might (one would suspect) distort the resulting distribution from
its intended form, but no significant deviations are detectable in
statistical tests.
IV.E.3.d. Permuted exon sizes.
This function creates successive random permutations of the observed order
of exon sizes ('successive' meaning that each permutation is generated from
the previous one, rather than from a common parental order). This
reference model is the one used by Gilbert and Glynias (1994). Large
numbers of simulations (>>100) should not be done from small numbers of
exon sizes (e.g., fewer than 10), or the generation of identical and
nearly-identical orders of exon sizes in different replicate exon sets will
reduce the expected statistical reliability of the final result. If the
numbers of exon sizes is large, this is a good non-parametric reference
model.
IV.E.3.e. Exponential exon sizes.
This function creates exponentially distributed exon sizes, with the option
for low-end censoring. It is not recommended in most cases, since in most
cases the observed distribution of inter-intronic distances will not be
exponential. In particular, if "intron sliding" has been invoked (see
section IV.A.5), an exponential distribution is invalid unless low-end
censoring is applied to screen out any inter-intronic distances that would
be prohibited in the observed set by the "sliding" rule (e.g., the use of
an exponential distribution by Gilbert and Glynias, 1994, is invalid for
this reason, among others). Even with censoring invoked, the distribution
of inter-intronic distances is usually much more like a lognormal
distribution than an exponential one, unless there are large numbers of
intron positions known for the gene (e.g., as in the case of GAPDH).
IV.E.4. Number of reference sets to generate.
The number of reference sets to generate is based on the desired accuracy
of the resulting P value, and is strictly limited by memory availability
when running in the DOS environment.
IV.E.4.a. Accuracy of the P value. The P value is expected to have
binomial variance, i.e.,
V = P * (1 - P) / (N - 1).
Imagine that 100 simulations have been done and two correspondence rules
have been tested. Since only two tests have been done, the 5% critical
level is applicable. Suppose that one test gives a P value of P = 5/100 =
5%, the other of P = 20/100 = 20%. These P values carry uncertainty: their
expected 95% confidence intervals are +/- 0.044 and +/- 0.080,
respectively. One may be confident that the second result (P = 20%) is not
significant (i.e., it is extremely unlikely that this P value is really <
0.05). However, how does one interpret the first P value? It could be
less than 1% (very significant!) or more than 9% (not significant at all!).
In such a case, one cannot make a reliable judgment about the status of the
reference hypothesis, because the P value itself carries too much
uncertainty. If 1000 simulations are performed instead, then the
probability might be found to have a more exact value of 0.043 or 0.078 or
0.061 or 0.036-- in each of these cases the reliability of the P value
would be sufficient that its relationship to the 5% critical level, either
higher (0.062, 0.078) or lower (0.036, 0.042), is reliable.
IV.E.4.b. Memory limitations. Practical memory limitations are not an
issue except in the DOS environment (especially 286-based machines). The
startup screen displays how much of the DOS standard 640 K block is
available for simulations, and makes an approximate calculation of the
total number of simulations that can exist in memory (exon sets and intron
sets combined) at any time. Regular users who wish to generate more than
one thousand sets of reference genes with more than ca. 40 introns or exons
per set should move to a non-DOS environment. DOS weenies can re-compile
ABaCUS without the graphics and with kMaxNumValues set to 1 + X (where X is
the maximum number of exons or introns needed per set) to maximize the
number of simulations possible.
IV.F. SCORING CORRESPONDENCES
IV.F.1. Types of rules. There are three general models for evaluating
correspondences:
1. Centrality of intron-associated residues.
2. Distance of intron positions to inter-element regions.
3. Extensity of exon-encoded peptides.
The centrality and distance scores are assigned directly to intron
positions, while the third type of score (extensity) is assigned directly
to exons. Centrality scores and extensity scores are based on measurements
of atomic coordinates-- thus they require a crystal structure. The
distance scores are based on structural elements defined by the
user.
IV.F.2. Common features of correspondence rules.
For ABaCUS, a "gene" is a set of exon sizes or intron positions. For all
types of scoring rules, the score assigned to a gene is the average score
for the intron positions or exon sizes in the gene. For all types of
rules, a lower score indicates greater conformity to the expectations of
Blake's conjecture (Blake, 1978) or the exon theory of genes as developed
by Go, Gilbert, and others (see references in Stoltzfus, et al., 1994).
IV.F.3. Centrality scores.
Centrality scoring is done by choosing "c=centrality" from the "a=ATOMIC
COORDINATES" submenu. Any observed or reference introns are scored using
the crystal structure in memory and a user-designated choice of scoring
rule. The lowest scores are achieved by centrally located
introns/residues. The scoring schemes implemented for centrality scoring
are:
1. intron score = percentage of pairwise distances > cutoff;
2. intron score = average of all pairwise distances;
3. intron score = maximum of all pairwise distances;
4. intron score = distance from center of mass of domain.
The first rule is somewhat similar to the intuitive rule used by Go (1981)
in proposing the boundaries of "modules" of hemoglobin. The second rule is
similar to the rule implied by Figure 1 of Blake (1981). Stoltzfus, et al.
1994 use only rule #4, which we feel is the definitive rule for centrality.
For multidomain proteins, you will be prompted to enter the domain
boundaries when using this rule. Specifically, the center of mass of each
domain is calculated, then introns are assigned a score equal to the
distance in Angstroms from the residue associated with the intron to the
center of mass of the domain in which it resides.
To implement centrality scores, an arbitrary choice must be made about how
to associate intron positions with residues in a crystal structure. For
ABaCUS, the residue associated with an intron is defined as the residue
encoded by a codon that is split by the intron, or that is bounded on its
5' end by the intron.
For information on centrality plots, see section V.C.4.
IV.F.4. Distance scores.
Correspondences with regard to defined structural elements are analyzed by
using distance scores. The complete set of all possible distance scores is
stored in any array. Any number of arrays may be created by the user, to
represent secondary structures, domains, motifs, modules, etc. Introns
from the observed set and any reference sets are scored by the distance
scoring array currently in memory when this scoring option is chosen.
There is a single option in the settings menu that affects the manner in
which distance scores are calculated (see V.C.9).
In essence, one uses distance scores to detect correspondences between
points on a line and segments of the line. For instance, one may ask
whether introns in protein-coding gene fall between or within structural
elements, or whether introns in structural RNAs fall between or within
defined regions, such as base-paired regions or exposed regions. This type
of scoring is readily adaptible to calculating the closeness or identity of
one set of points on a line with another set of points (e.g., how closely
does one set of introns match another set?).
IV.F.5. Extensity scores.
Scoring by the extensity of exon-encoded peptides is done using the
"e=EXTENSITY" scores option of the atomic coordinates submenu. Five
different scoring rules are implemented, some of which depend on a
user-supplied arbitrary cutoff value in Angstroms:
b (binary) score = 1 if any distance > cutoff; else score = 0;
n (number) score = number of inter-C-alpha distances > cutoff;
a (average) score = average inter-C-alpha distance;
m (maximum) score = maximum inter-C-alpha distance;
r (radius) score = radius of gyration.
Each rule assigns scores to exons based on measurements on the atomic
coordinates of the residues encoded by each exon, using the crystal
structure in memory.
The first three rules, based on distance cutoffs, are intended as precise
versions of the inexact methods of Go (1981), Gilbert (1986, 1985) and
others, in which arguments are made based on the appearance of a diagonal
plots with distance cutoffs in the range of 23-28 Angstroms. The first two
rules give somewhat erratic results. The second rule is equivalent in
effect to the rule used by Gilbert and Glynias (1994; they assign to genes
the sum, rather than the average, of exon scores, but this difference would
not affect the final ranking of observed and reference scores).
Stoltzfus, et al. (1994) concentrate on the "maximum" (a.k.a. "diameter")
rule and the radius of gyration. The radius of gyration is a measure of
3-dimensional dispersion, defined simply as the root mean square distance
of alpha carbons from the center of mass of the exon-encoded peptide.
IV.G. EVALUATING THE SIGNIFICANCE OF A CORRESPONDENCE
After each scoring of introns or exons, the results may be evaluated. A
set of introns (or a set of exons) in memory carries only a single set of
scores at a time, from the most recent scoring. The command "t=TEST" will
take the observed and reference scores in memory, calculate means and
standard deviations, and rank the observed score within the reference
scores. The mean of the standard deviation of exon scores within a
reference set is calculated, as well as the standard deviation of the mean
gene score.
A P value is calculated as the proportion of reference sets that score AS
LOW OR LOWER than the observed set. This P value represents the chance of
obtaining a correspondence as good or better than the one observed, if the
reference hypothesis is true. If the P value is less than 5% or 1%
(depending on the number of tests performed), then the reference hypothesis
may be false.
If the scores of the reference sets are normally distributed, then the
difference between the observed and reference means (expressed in standard
deviations of the reference mean) should be related to the P value by the
normal probability function (e.g., if P = 0.05, then the observed mean
should be lower than the reference mean by about 1.64 standard deviations
of the reference mean). Scores derived by the centrality and extensity
rules are usually distributed roughly normally. However, distance scores
assigned by arrays often have a skewed distribution, especially if a low
maximum score has been used to convert the array.
Note that every time the "t=TEST" command is successfully executed, a
description of a numbered experiment is stored in memory. The experiment
list in memory continues to grow with each new experiment, and it can be
saved as explained below.
IV.H. SAVING RESULTS; FURTHER ANALYSIS OF SCORES; etc
Each time the "t=TEST" command is executed, an experimental test of a
hypothesis has been performed. As a first approximation, each such test is
equally valid and therefore, in order to be rigorous, the conclusions drawn
from a set of tests should represent the results of all experiments, rather
than just "the ones that turned out right." Failure to follow this
methodological imperative tends to lead to errors in which one or a few
"significant" results from a large set of equally valid tests are singled
out for special attention. An example of this type of error can be found
in Go and Nosaka (1987) in which a subset of all available intron positions
is singled out for special comment because it shows a "significant"
correspondence.
In order to save the results of hypothesis-testing to disk, you must choose
"quit" from the main menu, and supply a name for the file to contain all
experiment summaries. The summary writer was designed to save most of the
parameters necessary to replicate each experiment (its good to take notes,
though). Short user-supplied comments can be added to the experiment
description in memory at the time the hypothesis is evaluated, and these
comments will be written to disk when the experiment summary is saved.
Under normal conditions, ABaCUS does not save detailed reference gene
data-- it saves the mean, standard deviation and ranking of the observed
sets relative to the reference score, and the rest is thrown away. This
makes it impossible to analyze (for instance) the statistical distribution
of reference gene scores, or to ask other interesting questions, such as
"How low would an observed score have to be to rank in the lowest 5% or the
lowest 1%?". However, questions such as these CAN be addressed if the user
takes special steps to save the relevant data. There are three ways of
doing this, each of which may be desirable under different circumstances,
depending on the reason for saving the results:
1) If the reference introns or exons have been scored, the "save" function
will include the scores when it writes the intron positions or exon sizes
to disk. If the reference introns or exons have been scored and evaluated,
the means and standard deviations will also be recorded. The resulting
file can be large: a file with 1000 scored sets of reference genes, with 15
introns in each set, takes up 200 K.
2) Settings can be changed to turn on a file writer that records the mean
score for each reference set (only the mean for each set-- not the
individual exon or intron scores). See section V.C.6.
3) The user may effectively "save" reference gene data by saving its
initial conditions. See section V.C.10 for instructions on how to manually
enter a random number seed that can be used at a later date to regenerate
the same data.
IV.I. PLOTTING DIAGONAL PLOTS AND EXON PLOTS
IV.I.1. Diagonal plots.
A diagonal plot, or C-alpha-C-alpha distance map, is a 2-dimensional
contour map of a 3-dimensional protein structure, based on the pairwise
distances between alpha-carbons, plotted on cartesian coordinates. Many
diagonal plots that appear in the literature show three contours: very
short pairwise distances (e.g., < 12 Angstroms) in gray, very long pairwise
distances (e.g., > 28 Angstroms) in black, and intermediate distances in
white (e.g., Go, 1981).
IV.I.2. Exon plots.
Exon plots are like diagonal plots, but they only show the distances between
residues encoded by the same exons. The plot thus appears as a series of N
right triangles with their hypotenuses along the diagonal, where N is the
number of exons. It is possible to make exon plots of both the inferred
ancestral set of exons, and reference sets of exons. Exon plots are
sometimes useful for developing a nuts-and-bolts understanding of why
different gene structures achieve different extensity scores.
IV.I.3. Plotting options.
ABaCUS is capable of making black & white distance plots (i.e., two
contours), or color distance plots with 16 contours. For black & white
plots, a single cutoff value distinguishes close and distant inter-residue
distances. For color plots, there is a scaleable relationship between the
16-color palette and the distance between residues. Also, color plots can
depict all distances (choose cutoff = 0.0 to do this), or only those
distances greater than an arbitrary cutoff value (e.g., 25 Angstroms). The
settings menu explains how to alter settings to suit your interests.
==========================================================================
V. ADDITIONAL DETAILS
==========================================================================
V.A. HARD LIMITS ON PARAMETERS
Limits are set differently depending on whether or not the program is
compiled in DOS:
limit DOS non-DOS
__________ _____ ________
kMaxNameLength 14 30
kMaxArraySize 2400 4000
kMaxNumValues 26 101
The first column of values is used if Compiled_in_DOS is #defined as 1 in
the header file "abacus.h"; the second column is used when Compiled_in_DOS
is set to 0.
The experienced user may wish to alter these limits. kMaxNameLength
refers to the length of file names. kMaxArraySize refers to the scoring
arrays used in distance scoring (the DOS limit of 2400 sites, or 800
codons, should be sufficient for most purposes). kMaxNumValues is 1 + the
maximum number of intron positions or exon sizes per gene that you wish to
use.
There is no hard limit on the number of residues in a crystal structure or
on the length of the gene represented by a set of intron positions or exon
sizes.
V.B. THE RANDOM NUMBER GENERATOR
The code for the uniform random number at the heart of ABaCUS's simulations
is taken from p. 282 of _Numerical Recipes in C_ (Press, et al., 1992; and
references therein). This is the "ran2" long-period (about 10^18)
pseudo-random number generator, described by the authors as "the generator
of L'Ecuyer with Bays-Durham shffle and added safeguards". It returns a
uniform random deviate between 0.0 and 1.0 (exclusive of the endpoint
values).
The routines for generating uniform intron positions and exponential exon
sizes, and the routines for permuting exon sizes and inter-intronic
distances rely directly on the uniform random number generator. The
routine for generating lognormal exon sizes makes use of Box and Muller's
general method of converting uniform random deviates into normal deviates.
V.C. EXPLANATION OF THE SETTINGS MENU
NOTE: The defaults for these settings are hard-coded. Any changes made to
the settings are completely forgotten as soon as you quit the program. I
probably should change the name to the "options" menu instead of the
"settings" menu.
V.C.1 Toggle between color and monochrome distance plots. This is
self-explanatory.
V.C.2. Toggle between single- and double-size distance plots. Normally,
the distance plot of a protein R residues long is plotted on an R X R
plane. That is, there is one pixel representing each Cartesian coordinate
of the diagonal plot. If the "double-size" option is chosen, each
Cartesian coordinate is represented by 4 pixels-- a 2 X 2 square of pixels.
Choose this option to enhance viewing of small proteins, such as hemoglobin
or cytochrome C.
V.C.3. Change color scale for distance plotting. The color constant is a
scalar used to convert an inter-C-alpha distance into a color code. The
default value of the color constant is 2.7 and the conversion formula is
color = nextLowestIntegerValueOf( distance / colorConstant )
Each integer between 0 and 15 is associated with a color in the 4-bit color
palette, as follows:
0=black 8=dark gray
1=blue 9=light blue
2=green 10=light green
3=cyan 11=light cyan
4=red 12=light red
5=magenta 13=light magenta
6=brown 14=yellow
7=light gray 15=white
For instance, if the distance between residues X and Y is 23.5 Angstroms
and the color constant is 2.7, then the value of distance/colorConstant is
8.69, and the next-lowest integer value of 8.69 is 8. Therefore, the color
at (X,Y) on the diagonal plot will be 8=dark gray. If distance /
colorConstant > 15, a white pixel will be displayed, representing the
greatest distance class.
V.C.4. Toggle on/off file with raw data for centrality plot. A graphical
representation of the centrality scores for all residues in a crystal
structure is useful in attempting to understand the meaning of this type of
scoring. The 'centrality plot' for a protein is a line graph representing
the centrality scores vs. the amino acid residue number. ABaCUS doesn't
actually make these plots, but it is capable of writing an output file with
all of the data (which can then be pasted into your favorite spreadsheet or
graphing program and used to make a centrality plot). To make the output
file, go to the settings menu and turn on the option to write centrality
scores to disk. Then load a crystal structure, and choose "c=centrality"
from the distance scoring submenu and choose the appropriate scoring
scheme, as though you were scoring a set of introns-- it doesn't matter if
there really aren't any introns in memory. A file named "cplot.sco"
containing the centrality scores for all residues in the protein will be
written to disk.
V.C.5. Change cryptic output from reference gene generators. This is for
those who wish to examine details of the distribution of reference exon
sizes or intron positions. Mainly, these options were useful when the
reference gene generators of ABaCUS were being tested for their ability to
produce the desired distributions.
V.C.6. Change cryptic output of file with distribution of scores. Once
this option is invoked, the complete distribution of reference scores (the
mean score for each reference set, not the individual exon or intron
scores) for each hypothesis that is evaluated will be appended to a file
called "nullscor.out". Each addition to the file also contains the
observed score and descriptive comments that allow the user to match the
set of scores with the experiment summary written using the summary writer.
V.C.7. Toggle between weighted and unweighted exon scores. Exon scores
will be weighted inversely by the size of the exon if this option is turned
on.
V.C.8. Toggle on/off pause to allow screen dumps of diagonal plots.
Normally, when a diagonal plot is being viewed, ABaCUS will show the plot
forever, or until the user presses a carriage return. During this time,
ABaCUS will absorb key combinations that might otherwise be used to access
an automatic screen-dumping utility such as PCXDUMP. Turning on the pause
simply puts the diagonal plot on a timer for about 20 seconds during which
a screen dump may be made before the diagonal plot disappears and the menu
reappears.
V.C.9. Treat gene edges as element edges when converting arrays. The
default settings for ABaCUS stipulate that the ends of a gene are treated
as the edges of an element. That is, if an alpha-helix includes residues
88-100 in a 100-residue protein, then an intron at (for example) position
96-1 is scored as though the nearest inter-element region lies just beyond
the end of the gene-- just beyond codon 100-- rather than just before codon
88. We recommend not changing the default setting. However, if the
alternative setting is chosen, be sure to have this option turned on *when
the array is converted* to a new maximum score, since the converter is the
function that implements this option. After the array has been converted,
it doesn't matter what the setting is at any later when the array is
viewed, saved, loaded, or used to assign scores. An array that has been
converted with the gene-edge=element-edge option turned off may be
converted back to its original form using the c=convert with the
option turned on. Of course, changing this option only makes a
difference in the case of proteins that have a structural element extending
to an edge (e.g., the last helix of hemoglobin chains often extends to the
very last residue of the protein).
V.C.10. Initialize random number generator with user-defined seed.
Normally, the random number generator is initialized at startup with
computer clock time (seconds elapsed since 0:00:00 Greenwich mean time, 1
Jan 1970) and this results in a unique set of numbers for each simulation
experiment. However, if there is a need to generate exactly the same set
of data twice, a seed may be set manually, then re-entered for a perfect
replicate. There would only be two reasons to do this: a) you are testing
the reproducibility of ABaCUS's routines to make sure there are no wierd
bugs getting into them; b) you are an anally retentive type wishing to have
complete reproducibility for the purposes of record-keeping. In either
case, enter an unsigned 16-bit integer greater than 0, that is, a whole
number less than 65,536. If exactly the same conditions (seed, reference
model, number of introns/exons, gene length) are used twice, then
exactly the same set of reference genes will be generated twice.
V.D. HOW TO CONTACT THE PDB
Access to the Brookhaven Protein Data Bank (Bernstein, et al. 1977; Abola,
et al., 1987) is available by FTP or by Gopher (type 1, port 70, path 1/)
to pdb.pdb.bnl.gov (130.199.144.1).
==========================================================================
VI. REFERENCES
==========================================================================
Abola, E.E., et al. 1987. Protein Data Bank, pp. 107-132 in
_Crystallographic Databases - Information Content, Software Systems,
Scientific Applications_ ed. F.H. Allen, G. Bergerhoff, and R. Sievers
(Data Commission of the International Union of Crystallography, Cambridge,
1987).
Banner, D.W., et al. 1975. Nature 255: 609.
Bernstein, F.C., et al. 1977. J. Mol. Biol. 112: 535;
Blake, C.C.F. 1978. Nature 273: 267.
Blake, C.C.F. 1983. Nature 306: 535.
Dibb, N.J. and A.J. Newman. 1989. EMBO J. 8 (7): 2015.
Doolittle, W.F. 1987. Am. Nat. 130: 915.
Gilbert, W., M. Marchionni, G. McKnight. 1986. Cell 46, 151. See also D.
Straus and W. Gilbert, 1985. Mol. Cell. Biol., 5(12): 3497; and N. Lonberg
and W. Gilbert. 1985. Cell 40: 81.
Gilbert, W. and M. Glynias. 1994. Gene 135: 137.
Go, M. 1981. Nature 291: 90.
Go, M. 1983. Proc. Natl. Acad. Sci U.S.A, 80: 1964.
Go, M. and Nosaka. 1987. Cold Spring Harbor Symp. Quant. Biol. 52: 915.
Kemmerer, E.C. M. Lei and R. Wu. 1991a. J. Mol. Evol. 32: 227.
Kemmerer, E.C. M. Lei and R. Wu. 1991b. Mol. Biol. Evol. 8(2): 212.
Press, W.H. et al. 1992. _Numerical Recipes in C_ (Cambridge Univ. Press,
London, 1992, 2nd ed.).
Raitt, D.C., R.E. Bradshaw and T.M. Pillar. 1994. Mol. Gen. Gen. 242: 17.
Stoltzfus, A., et al. 1994. Testing the Exon Theory of Genes: The
Evidence from Protein Structure. Science XXX: XXX.